Dev chunk optimization postprocessveppanel#390
Conversation
…. Upgrade pybedtools. Added wave
…olving conflicts.
…ostprocessing_annotation.py. panel_postprocessing_chunk_size deleted.
…nd within-chrom chunking optional
|
Hi! While checking the cord bloods run (combining DupCaller and deepUMI callings) I saw that one of the places in which we have a bottleneck is in the |
There was a problem hiding this comment.
Looks good Miguel!
I left some comments and suggestions.
nothing critical.
- only some minor fixes to pass the nextflow linting
- update default chunk_size to 1M so that bigger panels get chunked
and then another comment is that maybe we might need to be more generous in terms of memory of some steps when running bigger cohorts, but we will see this as we start using it.
I would apply the suggestions if you agree and then merge it to dev so that it starts to get tested by all of us and we tune it from there
thankss!!
| label 'process_single' | ||
|
|
||
| conda "python=3.10.17 bioconda::pybedtools=0.12.0 conda-forge::polars=1.30.0 conda-forge::click=8.2.1 conda-forge::gcc_linux-64=15.1.0 conda-forge::gxx_linux-64=15.1.0" | ||
| container 'docker://bbglab/deepcsa_bed:latest' |
There was a problem hiding this comment.
I think that the receipe of this container is not pushed to https://github.com/bbglab/containers-recipes
if you have it localized somewhere try to push it so that we have everything centralized there, but go ahead with the merge
|
Wait Miguel, we found some weird behaviour in the test run with the cord bloods I will let you know once we solve it |
…ove chunking logic
…tion. Update omega snapshot with 2 decimals
…oved processing. Resources adjusted
…dd error handling
…g in VEP annotation and adjust related schema defaults. Linting fixes
- Modified nextflow.config to include general reference paths and skip validation for specific parameters. - Increased resource limits for processes to accommodate VEP execution. - Changed panel_sites_chunk_size to 0 and disabled parameter validation. - Added new input_maf.csv file with sample and VCF path data for testing.
|
MAF tests from current dev added 330690e Test passed. New snapshots match the ones from dev. |
|
merge FROM dev done and test passed: We can go ahead with the merge to dev |

[copilot generated]
Performance Optimization: Chunked Processing for Large Panel Annotations
Overview
This PR introduces memory-efficient chunked processing for VEP annotation post-processing, enabling the pipeline to handle arbitrarily large panel annotations without memory constraints.
Changes Summary
✅ Implemented Chunking Optimizations
1.
panel_postprocessing_annotation.py- Chunked VEP Output ProcessingTechnical details:
Process:
CREATEPANELS:POSTPROCESSVEPPANELVCFANNOTATEPANEL2.
panel_custom_processing.py- Chromosome-Based Chunked LoadingTechnical details:
Process:
CUSTOMPROCESSING/CUSTOMPROCESSINGRICH❌ VEP Cache Storage Location - No Performance Impact
What was tested:
/workspace/datasets/vepor/data/bbg/datasets/vep)Results:
ENSEMBLVEP_VEPprocessCommits:
035a0c7(April 3, 2025): Added VEP cache beegfs support8e40d83(April 24, 2025): Removed VEP cache beegfs optimization (no benefit)Current approach:
params.vep_cacheResource Configuration
Updated resource limits for chunked processes:
Integration Points
Affected Subworkflows:
CREATEPANELS→POSTPROCESSVEPPANEL→ processes VEP output in chunksCUSTOMPROCESSING/CUSTOMPROCESSINGRICH→ uses chunked loading for custom regionsPipeline Flow:
Testing
Tested on:
Validation:
Performance Impact
Migration Notes
No breaking changes. Existing pipelines continue to work with improved memory efficiency.
Related Commits
276152d: Chunking forpanel_custom_processing.py035a0c7: VEP cache beegfs attempt (added)8e40d83: VEP cache beegfs removal (no performance gain)1dffd94,945c129,d243ebc, etc. (resource tuning)Conclusion
This PR successfully implements memory-efficient chunked processing for panel annotation post-processing, enabling the pipeline to scale to arbitrarily large panels without memory constraints. The VEP cache storage location experiment confirmed that computation, not I/O, is the bottleneck for annotation runtime.